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Abstract 

Single nucleotide polymorphisms (SNPs) often appear in clusters along the length of 
a chromosome. This is due to variation in local coalescent times caused by, for example, 
selection or recombination. Here we investigate whether recombination alone (within a 
neutral model) can cause statistically significant SNP clustering. We measure the extent of 
SNP clustering as the ratio between the variance of SNPs found in bins of length I, and the 
mean number of SNPs in such bins, of For a uniform SNP distribution of / [i\ = 1, for 
clustered SNPs erf > 1. Apart from the bin length, three length scales are important 
when accounting for SNP clustering: The mean distance between neighboring SNPs, A, 
the mean length of chromosome segments with constant time to the most recent common 
ancestor, £ scg , and the total length of the chromosome, L. We show that SNP clustering is 
observed if A < £ Beg "C L. Moreover, if I <C £ Be g *C L, clustering becomes independent 
of the rate of recombination. We apply our results to the analysis of SNP data sets from 
mice, and human chromosomes 6 and X. Of the three data sets investigated, the human X 
chromosome displays the most significant deviation from neutrality. 

Introduction 

Single nucleotide polymorphisms (SNPs) are the most abundant polymorphisms in most pop- 
ulations. Due to their ubiquity and stability they are useful in the diagnosis of human diseases 
( JZhou et al, 2002| ), detection of human disease genes ( |Willey et al, 2002[ ), and gene map- 



ping in org anisms as diverse as humans flMclNNES et al, 200ID, Arabidopsis thaliana (C ho 



et al, 1999), and Drosophila ( |Berger et al., 2U01| ). For this reason, several large-scale SNP- 
mapping projects are currently under way in eukaryotic model organisms including A. thaliana 
( |http://arabidopsis.org/Cereonp, Drosophila flHOSKTNS et al., 2UU1[), mouse ( Lindblad-Toh 



et al, 2000), and human (International Human Genome Sequencing Consortium, 



2001; [The International SNP Map Working Group, 2001| ). 



A central question in the analysis of data collected in the context of these projects is how SNPs 
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are distributed along a chromosome and what inferences about selection might be drawn from 
this distribution. This question can be addressed at the level of individual polymorphisms (Fay 



et al, 2001) or at the level of the whole genome ( [Lindblad-Toh et al, 200C| ). 



Lindblad-Toh et al. (2000| ) have observed that SNPs cluster along chromosomes in mice. 



This clustering may either be due to variation in local mutation rates, or variation in local coa- 
lescent times. The hypothesis of local differences in mutation rates in mice was rejected, leaving 
differences in local coalescent times as the most likely explanation of SNP clustering. In the 
case of the mouse genome such variation in coalescent times may be due to selection in the wild 
or selection for unusual coat colors (c.f. breeding of 'fancy' mice in the eighteenth and nine- 
teenth cent uries). Another possibility mentioned is the effect of inbreeding ( Lindblad-Toh 
et al, 2000). 

On the other hand, recombination alone leads to fluctuations in the time to the most recent com- 
mon ancestor along a chromosome ([Hudson, 1990p. Since time to the most recent common 



ancestor is proportional to the number of SNPs found in the respective chromosome segment, 
recombination in a neutral model might be sufficient to account for genome-wide SNP cluster- 
ing. 

A well established stochastic model for neutral genetic variation is the constant-rate mutation 
coalescent process under the infinite-sites model. According to this model, the total number 
of SNPs found in a sample is expected to be Poisson distributed with parameter A = 9T tot /2, 
where T tot is the total time to the most recent common ancestor, 9 = AN e u, u is the probability 
of mutation per site per generation, and iV c is the effective population size (see [Hudson (1990D 



for a review). This is a global property of any contiguous stretch of DNA, and holds in the 
absence of recombination, where all sites have the same genealogy (and thus the total time T tot 
to the most recent common ancestor is constant along the chromosome). In the presence of 
recombination, the number of polymorphisms conditional on the genealogies of all sites is still 
Poisson distributed with parameter 

- / dxT tot (x) . (1) 
* Jo 

Here x denotes the position on, and L the length of the chromosome. Since the value of param- 
eter (1) fluctuates between samples, the total number of SNPs is no longer Poisson distributed, 
except in the case of very frequent recombination where the variance of this parameter tends 
to zero. These properties of the coalescent are reviewed in |HUDSON (1990| ). For more recent 
reviews see 



Nordborg (2001) and Nordborg and Tavae (2002) 



In this paper we investigate local SNP statistics: local spatial fluctuations in T tot (x) due to re- 
combination ( |Hudson, 1990[ ) may give rise to local variations in the SNP density. Here we 



study the implications of this idea for the analysis of experimental SNP data. 

Specifically, we address the following five questions. How significant is SNP clustering caused 
by recombination? How does the clustering depend on the parameters of the model (the sample 
size, the mutation rate, and the recombination rate)? On which length scales are such clusters 
expected? How does the clustering depend on the length scale on which it is observed? Finally, 
can recomb ination alone account for the clustering of SNPs observed in mice ( Lindblad-Toh 
et al , 2000), or in the human genome (T he International SNP Map Working Group, 
2001)? In the following these questions are answered by analyzing coalescent simulations. 
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Model and methods 

We use coalescent simulations under the neutral infinite-sites model to generate allele samples 
( [HUDSON, 199(J| ; |Nordborg, 200 1| ). As usual, this model incorporates mutation (with rate 



9 = 4iV c w) and reciprocal recombination (with rate R = 4iV c r, where r is the probability of a 
recombination event per generation per sequence). 

The coalescent process generates genealogies for all sites of the n sequences in a given sample. 
In the absence of recombination, these genealogies are identical for all sites x. In particular, 
the total time to the most recent common ancestor T tot is the same for all sites. For a given 
genealogy, mutations are generated as a Poisson process with rate A = 6T tot /2. This implies 
that the density of SNPs along the genome is uniform: in this case SNPs do not cluster. 

If the recombination rate is non-zero, the total time to the most recent common ancestor, T tot (x) , 



varies as a function of the position x ( [HUDSON, 199QQ . This corresponds to fluctuating local 



mutation rates X(x) = 8T tot (x)/2. In the presence of recombination, the distribution of SNPs 
is thus determined by a Poisson process in x with fluctuating rates \(x). Figure [j] shows such a 
process for realizations of X(x) corresponding to three different sets of parameter values. The 
fluctuating rates \(x) are shown as solid lines. Note that \(x) is constant over segments of the 
chromosome which are identical by descent (called MRCA segments in the following). The 
figure illustrates possible local clustering of SNPs as a consequence of local variation in A(x) 
due to recombination. While the density of SNPs in the top and bottom panels is uniform, the 
middle panel exhibits clustering in regions of high A(x). 

In the remainder of this paper, the local clustering such as that exhibited in the middle panel of 
Figure |T] is described quantitatively. It is customary in experimental SNP surveys to count SNPs 
in bins of length I. Such a bin might, for example, correspond to a sequence tagged site (STS), 
or some arbitrarily chosen stretch of sequence. The mean number of SNPs per bin is then 

-i -^bins 

— j>i(0 (2) 



and its variance 



-^bins 

r- 



, 2 



(Ti 



-^bins 

3=1 



-J2 fa(O-w) , o) 



where rij(l) is the number of SNPs in bin j and iV bins is the total number of bins surveyed (c.f. 
Figure |2|). In some SNP studies, the bins are arranged contiguously along the chromosome (as 
depicted in Figure ||), in some cases the bins are randomly distributed, or equidistributed but 
non-contiguous. 

We compare empirical values for of/ with results of coalescent simulations. In these simula- 
tions we determine the ensemble average (denoted by (of) in the following) and the distribution 
of of over random genealogies with mutations. We keep the total number of mutations, S, fixed 
to the empirical value. The local rates A(x) are then given by 

X(x) = (S/L)T tot (x)/2, (4) 

and the value of Hi is constant between different realizations of the ensemble. One has fii = 1/ A 
where A = L/S is the mean distance between neighboring SNPsQ. For uniformly distributed 

r -i-l 

'if S fluctuates from sample to sample, A = L/(S) = 8 Y^k=i • He re n is the sample size. 
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SNPs generated with an x-independent rate, a? / ' [L\ = 1. In the case of fluctuating rates 

o?/m > 1 (5) 

is expected, since spatial fluctuations of X(x) give rise to an increased "compressibility" of the 
sequence of SNPs. In other words, of / fii measures the "compressibility" of the sequence of 
SNPs: the larger of / the more significant SNP clustering is on scales I and larger. 

To meaningfully speak about SNP clustering, it is necessary that A is much smaller than L. 
This is the case considered in the following. In addition, we assume that bins are much shorter 
than the chromosome on which they are placed, i.e., we make the following assumptions 

/ < L and A < L . (6) 

It is clear (Figure [TJ) that the statistical properties of af / fii crucially depend on how rapidly the 
rate X(x) fluctuates as a function of x. It is convenient to define a length scale £ seg as the ratio 
of L and the (average) number of jumps of X(x) along the total length of the chromosome. This 
length scale corresponds to the average MRCA segment length in Figure []]. Here the average 
is over the chromosome for a given realization of X(x) as well as over an ensemble of such 
realizations; 4eg depe nds on the recombination rate R, the sample size n, and L ( Griffiths 
and Marjoram, 1997). The relative sizes of the mean spacing A between neighboring SNPs, 
of the bin size /, the chromosome length L, and of the average MRCA segment length £ seg will 
play a crucial role in determining SNP clustering. 

In the following section, we analyze local SNP clustering in the model described above. We 
determine the significance of the four length scales A, I, £ seg , and L for the statistics of the ob- 
servable af I Hi and analyze for which parameter values 9, R, and on which length scales SNP 
clustering due to recombination is expected to be most significant. In the final section, we dis- 
cuss the implications of our results in relation to genome-wide surveys of SNPs in mice and 
humans. 

Analysis of SNP clustering 

Characterization of the spatial fluctuations of X(x): For a given genealogy under the neu- 
tral infinite- sites model with recombination SNPs are distributed according to an inhomoge- 
neous Poisson process, that is, according to a Poisson process with a rate X(x) varying along 
the chromosome (see Figures |l]b and c). Given the function X(x), the probability of observing 
n(l) = fcSNPs in bin [0, /] is 

P(n(l) = k\X(x)) = y(J^dxX(x)y exp ( - J dx X{xfj . (7) 

Moreover, given the function X(x), counts of SNPs in non-overlapping bins are statistically in- 
dependent. 

Theoretical predictions are computed as ensemble averages over random genealogies, corre- 
sponding to averages over random functions A(x). These ensemble averages introduce corre- 
lations in the combined process. Such correlations may be weak, but they can be long-ranged. 
Their range is determined by the length scale on which the random rate X(x) varies. As Figure |T] 
shows, X(x) is a piecewise constant function: along an MRCA segment the rate is constant, and 
varies between MRCA segments. The three panels in Figure [I] correspond to the three cases 
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4e g = L (a), A < 4e g < ^ (b), and £ scg < A (c). The average MRCA segment length 
depends on the sample size n as well as on the recombination rate R. According to Griffiths 
and Marioram (1997) 

£ seg = L {! + [!- 2R/n/(n + l)]}~ 1 , (8) 

where the denominator denotes the expected number of changes of ancestor along the chromo- 
some [notice that eq. @ does not describe the expected number of sites with the same MRCA 
as pointed out by |Wiuf and Hein (1999D 1. The length £ seg describes the scale on which the 



correlations between local mutation rates X(x) decay. For x < £ seg these correlations are strong. 
For x 3> 4e g , on the other hand, the correlation function 

_ (X(y)X(y + x))-(X(y)) 2 fQ , 

[ ) (A 2 (y)> - (X(y)y W 



decays to zero. [Kaplan and Hudson (19851 ) have shown that correlations between the times 



T tot pertaining to two loci in an m-locus model decay according to for large values of 
C (here C is the recombination rate between those loci). Identifying C with an effective re- 
combination rate R c q = xR/L (where R is the recombination rate between the ends of the 
chromosome) one concludes that C(x) decays as x _1 for large x. In summary, for any finite 
value of R, correlations between local mutation rates are large on scales up to £ seg , and decay 
for larger distances. These correlations may affect the fluctuations of of. 

Fluctuations of of: The empirical observable of / is expected to fluctuate from sample to 
sample^. Since correlations between X(y) and X(y + x) decay as \x\ grows, the fluctuations of 
of/ Hi tend to zero in the limit of infinite L (with £ seg and I constant): 

lim of I 'iii = ■ 

L-,00 ' /P V 111 ^ (10) 
£so g ,/ finite 

In this limit the process is thus self-averaging (ergodic): eq. ( |TTj| ) implies that the averages of 
n{l) and of its moments along the chromosome equal the ensemble averages (n(l)) of n(l) (and 
of its moments), see also ( Pluzhnikov and Donnelly, 1996p . Here n(l) is the SNP count 
in one bin of a given sample, and the ensemble average is taken over random genealogies with 
mutations. For the case where n = 2, (n 2 (l)) can be derived from eq. (15) in QHudson, 1990| ) 
by replacing the recombination rate in this equation with IR/ L. One obtains 



2/ -, / 2 r 26U + LUL rzu + 16- id + vy^\ 

crf/Ui ~ 1 + ~T7m —C-\ -j= — log = =) (11) 

A C 2 L 2a/97 V2C+ 13 + a/97 13 -a/97/ 



23C + 101, /2C + 13- v^lS + x/y?^ 



2V97 V2C + 13 + ^13- V97' 



C-5, {C 2 + l3C + tt 



18 

with C = IR/L. Eq. (JTTJ) is approximate (it was derived for an m-site model in which each site 
obeys the infinite- sites assumption, in the limit of m — > 00). 

When L is large (much larger than I and £ seg ) but finite, the distribution of a 2 / fj,i is expected 
to be narrow, since of/ /i/ itself is an average over a large number of (approximately) indepen- 
dent bins. When the sample-to-sample variations of of/ [L\ are small, theoretical models for SNP 

2 ln our simulations af does, but //; does not, since S and L are constant. 
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clustering are easily tested (and possibly rejected). Empirically determined observables [such 
as, for example, the variance in the number of loci that differ between pairs of haplotypes in 
( JHaubold etal, 2002[ )] often have broad distributions; it is thus significant that in recent years 
longer and longer contiguous chromosome segments have been sequenced and locally averaged 
observables such as af / fa are now available. In empirical studies usually / <C L. Consequently, 
it is the ratio £ scg / L which determines the statistics of of/ fa . 

As pointed out above, recent empirical data for of were obtained for contiguous non-overlapping 
bins [as depicted in Figure [| ( [The International SNP Map Working Group, 200 ip ]. In 
other cases, ho wever, the bins were randomly distributed over the chromosome (L indblad- 
Toh et al, 2000), or equally sp aced but far apart from each other ( The International SNP 
Map Working Group, 2001). How does the statistics of of depend on the number and the 
distribution of bins over the chromosome? The expected value of af is independent of the dis- 
tribution of bins [if l,£ seg -C L it is approximately given by eq. dTT|)]. The fluctuations of af, 
however, critically depend on the number and distributions of bins. The following analysis is 
performed assuming contiguous bins. When comparing with empirical data, however, confi- 
dence intervals for of/ fa were obtained using the empirical number and distribution of bins. 

Finally, as £ scg approaches L, the fluctuations of of / fa are expected to increase. In the absence 
of recombination (£ scg = L), SNPs are distributed according to a Poisson process and the fluc- 
tuations tend to zero (if I <C L). 



SNP clustering: We have determined the fluctuations of of/ fa using coalescent simulations 
for sample size n — 2, proceeding in five steps: (1) generate a large number of samples of 
sequence pairs of length L [the results discussed below correspond to values of L ranging be- 
tween 10 6 and 1.6 10 8 bp]; (2) determine the SNPs within each sample (each pair of sequences); 
(3) assign the SNPs to contiguous bins as illustrated in Figure |2|; (4) calculate af/ fa for each 
sample; (5) finally, average over samples. 

In these coalescent simulations, the bin size / was taken to be much smaller than L, correspond- 
ing to, say the length of an STS compared to that of a mouse chromosome. Figure |3]a shows 
the results of coalescent simulations in comparison with eqs. ( |Tlj ) and (fB|). In keeping with the 



above discussion, eq. < J11[ > is adequate when £ seg is much smaller than L. In this regime, the 
fluctuations of of / fa are small (but finite since iVbins = L/l is finite). As £ seg approaches L, 
eqs. ( JTTT ) and ((TJJ) are inappropriate, and the fluctuations increase significantly, as expected. 



Three qualitative observations emerge from our simulations: 

1. in the region A < £ scg <C L, the observed values of af / fa are larger than unity. If £ seg is 
much smaller than A, or if £ scg approaches L, af / fa — > 1; 

2. for small values of I, af / fa exhibits a plateau for intermediate values of £ scg (indicated by 
a dashed line in Figure |3|a); 

3. for larger values of I, the plateau disappears. 

These qualitative observations can be understood as follows. 
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1 . In the absence of recombination, where £ scg = L, a uniform SNP distribution is expected. 
In this case 



of M = i 



(12) 



[contradicted by equation (12)? and if not, why not?] as pointed out above. Conversely, if 
4eg is much smaller than / and A, the Poisson process averages over the fluctuating local 
rates \(x). One thus expects [see eq. (f7|)] local uniformity of SNPs with rate 



Again, af / fii = 1. In contrast, in the regime A < £ seg <C L significant SNP clustering is 



2. Consider the situation / <C £ seg <C L. In this regime, since I is much smaller than 
£ seg , most bins overlap with only one MRCA segment and genealogies within a given 
bin are identical. Since £ scg is much smaller than L, af / [i[ can be calculated assuming 
independent genealogies for each bin. The result can be obtained from eq. (11) in the 
limit of C -> 0: 



This means that of/ fii exhibits a plateau as a function of £ seg (its value is equal to 1 + 1/ A 
and thus independent of £ scg or R). Result (jT4]) is shown as a dashed line in Figure |3|a. 
The plateau is cut off by I for small values of £ seg , and by L for large values of £ seg . 

3. There are on average //A SNPs in a bin of size /. If £ seg is much smaller than I, these 
SNPs are distributed over many MRCA segments. If the counts per MRCA segment were 
statistically independent, one would expect af oc £ scg , and thus (af) to increase roughly 
proportional to £ scg (eq. dTT] ) shows that there are logarithmic corrections to this simple 
model). As £ seg approaches /, this increase is cut off; (af)/fii — > 1 as £ seg — > L. If I is 
large (/ ~ L), there is no plateau. 

In summary, of / \ii reflects local SNP clustering. It is expected to be most significant in the 
regime A < £ scg <C L. In many organisms S is of the order of R. Forn = 2 eq. @ implies that 
4eg is roughly 1.5 A. In such cases, recombination alone gives rise to SNP clustering. 

This clustering is observed on length scales of the order of £ seg . Eq. (JTT|) shows that its effect on 
(af)/[X[ is most clearly seen if / > £ seg (compare Fig. |3|b). This observation has two important 
consequences: (1) in empirical situations it is advisable to choose the bin size / at least as 
large as £ seg ; (2) deviations from the model considered may be associated with length scales 
much longer than £ seg . Such deviations will be most clearly seen if the bin size I is equal to or 
larger than this length. In short: the dependence of af/ Hi on / indicates on which length scales 
clustering of SNPs occurs. 

Data Analysis 

In the following we discuss the implications of our analysis for the interpretation of SNP data 
from mouse and human. In both cases we ask whether the neutral model can be rejected. 




(13) 



observed. 



af/m ~ 1 + l/A . 



(14) 
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Mouse data: In their survey of SNPs in mice, |Lindblad-Toh et al. (2000D observed that 



SNPs were not distributed uniformly across the genome. A possible explanation for this is 
selective breeding, which has certainly taken place in the recent evolution of the mice strains 
investigated. On the other hand, the SNP clustering in mice might also be due to recombination 
alone. 

Figure |] shows the empirically determined value of crf/fAi [for M. m. domesticus SNPs, see 
Lindblad-Toh et al. (2000| )1 in comparison with coalescent simulations for (af) / L was 



taken to be 1.6 10 8 bp, corresponding to the average chromosome length. A value for £ seg can be 
estimated from the aver age recombination rate in mice, approximately 0.5 cM/Mb (N achman 
and Churchill, 1 992)! Here we assume an effective population size N c = 10000. The 
average bin length I is the average length of the sequence tagged sites investigated, i.e. 118bp, 
smaller than A. SNP clustering on the scale of I is thus expected to be small. Moreover, 
since / is much smaller than £ seg , (af) / fii is expected to exhibit a plateau as a function of £ seg , 
at (erf) / fjLi = 1.12. While the plateau is found in the simulations (Figure Q), the empirically 
determined value of af / hi deviates significantly from neutral expectations (Figure |J af / fii = 
1.47). This increase of af/fii above the value expected under neutrality is consistent with the 
earlier conclusion that selection plays an important role in shaping the genome- wide distribution 
of polymorphisms in mice ( |Lindblad-Toh et al., 2000| ). In order to demonstrate this more 



conclusively, long-range SNP data would be of great interest for two reasons: (1) the larger 
/, the larger deviations from Poisson statistics are expected (ideally, I would be of the order of 
£ seg ). Figure ^ shows the increase in neutral SNP clustering if I is increased to 5kb. (2) Selection 
may act on length scales greater than £ seg and may thus contribute only weakly at very small 
values of I such as those corresponding to an average STS. 

Human chromosome 6: The distribution of chromosome- wide human SNP data was empiri- 
cally determi ned for / = 460bp and / = 200kb by The International SNP Map Working 
Group (2001). Figure |5|a shows our coalescent simulations compared to the empirical data for 
these length scales. Consider first the case of I = 460bp. In the simulations, (af) / /i/ exhibits a 
plateau for intermediate values of £ seg at (af)/fj>i ~ 1.34 [according to eq. (|I4|)]. This implies 
that the choice of £ scg attributed to the empirical value is uncritical. Empirically, af / fii = 1.44. 
From Figure |]a we conclude: given the degree SNP clustering observed in the human genome 
on scales of the order of I = 460bp, the neutral model cannot be rejected with confidence. 

The situation for / = 200kb is very different. In this case, the simulation results do not exhibit 
a plateau. The numerical results indicate that (af) / fii increases roughly with £ scg (for £ scg <C I), 
as suggested above. Furthermore, the empirical value for af / hi (labeled (1) in Figure |5]a) lies 
significantly above the values for the neutral infinite-sites model with recombination [the corre- 
sponding value of 4e g was estimated assuming an effective population size of N c = 10000 and 
a recombination rate of 1 cM/Mb ( JPritchard and Przeworski, 2001[ )]. This deviation is 



possibly caused by selection: the HLA system, which contains more than 100 genes and spans 
more than 4 Mb on the short arm of chromosome 6 ([Klein etal., 1993|), has an exceptionally 



high SNP density. This is maintained by balancing selection ( The International SNP Map 
Working Group, 2001; |Q'hUigin et al., 2000| ). The inset of Figure |a shows the empiri- 



cally determined distribution of the number of SNPs per bin P(n(l) — k). It exhibits a strong 
tail for large values of k, which may be due to selection. By ignoring this tail the estimate 
(labeled (2) in Figure |]a) for of j \ii is reduced considerably. Given the uncertainty as to which 
value of 4c g should be assigned to the data points, one may argue that this second estimate is 
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consistent with the neutral infinite-sites model. 

Human X chromosome: Finally, Figure |5]b shows the empirical estimate of of//i; given 
I = 200kb for the X chromosome. This empirical estimate was reduced by discarding the 
tail in the distribution P{n{l) = k) for large k as was done in the analysis of the chromosome 6 
data. However, both estimates of SNP clustering on the human X chromosome deviated signif- 
icantly from neutrality. This observation is consistent with the fact that due to its hemizygosity 
in males chromosome X should be affected more by selection than autosomes such as chromo- 
some 6. 

Discussion 

In this paper we investigate whether the observations of SNP clustering in mice and humans are 
compatible with neutral expectations. We chose the variance in the number of SNPs found in 
equal-length contiguous divided by the mean number of SNPs found in each bin as a measure 
of SNP clustering: of/ /!;. Since under a Poisson distribution the variance is equal to the mean, 
af I Hi = 1 in the absence of recombination. If SNPs are clustered, of/ [Li > 1. 

Whether or not SNP clustering is significant depends on the relative sizes of the mean spacing 
between neighboring SNPs, A, the mean length of segments with constant time to the most re- 
cent common ancestor, £ scg , and the total length of the chromosome, L. Specifically, clustering 
is observed if A < £ scg <C L. In contrast, if recombination is either very frequent compared to 
mutation (£ scg <C A) or very rare (£ seg <C L), no clustering is observed]]. 

We have shown that it is essential to consider the effect of the scale on which SNPs are sampled, 
I. In our simulations / describes the length of contiguous non-overlapping bins. This length is 
short compared to the length of the chromosome, and the corresponding large number of such 
bins leads to narrow confidence intervals around af for the biologically relevant parameters. As 
a result, meaningful comparisons between model and observation can be made. In the case of 
the mouse, the neutral model is rejected with marginal significance. This conclusion depends 
on the assumption of a "true" recombination rate for mice. This is difficult to know, but Figure 
(Q) shows that our hypothesis test is quite robust with respect to errors in the estimation of R 
(and hence £ seg ). 

In the case of human chromosome 6, the influence of I on the outcome of the neutrality test was 
striking. Significant SNP clustering was observed for large but not for small / (Figure |5|a). For 
large bins (I = 200 kb) the distribution of the number of SNPs per bin had a strong positive 
skew (Figure |5]a, inset). By cutting off the tail of bins containing many SNPs, clustering was 
reduced to its neutral level. 

No such effect of cutting off the tail of SNP-dense bins was observed for the X-chromosome. 
It therefore constitutes the most significantly non-neutral SNP collection among the three data 
sets investigated in this study. 

Factors that might lead to such a rejection of the neutral model include population expansion, 

3 Rather than the recombination rate R we use the corresponding length scale ^ seg as our point of reference for 
discussing SNP clustering; £ scg can directly be compared to the other length scales of the problem, viz. the length 
of the bins in which SNPs are sampled in experiments, I, as well as A, and L. Moreover, equation (||) shows 
that £ scg is a simple function of L, R, and the sample size n, thereby establishing the link between simulations 
conditioned on R and our observations. 
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population subdividion, and variation in (physical) mutation rate. In the case of the data sets we 
have investigated, some illuminating biology pertaining to these factors is known. LlNDBLAD- 
Toh et al. (2000) tested whether unequal physical mutation rates could account for their ob- 
servation of SNP clustering in mice. Their approach was to resequence 16 STSs with no SNPs 
and 16 STSs with five or more SNPs from closely related species of mice. They observed that 
the classification of high-scoring and low-scoring STSs was not reproduced in these other taxa 
and concluded from this that fluctuations in inherent mutation rates could not account for the 
observation of significant SNP clustering. The claim that selective breeding has been important 
in shaping the SNP distribution in mice is plausible, but other factors such as population expan- 
sion and subdivision can presumably not be ruled out. 

The situation is slightly different in the case of human chromosomes 6 and X, where deviation 
from the neutral model was much more pronounced for the sex chromosome than for the au- 
tosome. Since all chromosomes have undergone the same history of population expansion and 
migration, selection seems to be the only explanation for this difference. The hemizygosity in 
males of the X-chromosome, which makes most deleterious mutations dominant in males, fits 
well with this conclusion. 
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Figure 1: Recombination leads to spatial fluctuations in local coalescent times, T tot (x) (H ud- 
SON, 1990), which in turn cause fluctuations of the local mutation rate \(x) (solid lines). Shown 
are three realizations of X(x) together with the locations of S = 50 SNPs (vertical bars) for n = 2 
and (a) R = 0, (b) R = 10, and (c) R = 1000. 
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Figure 2: A chromosome of length L is divided into contiguous bins of length /. The number 
of SNPs in bin j is denoted by rij(l). 
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Figure 3: Coalescent results for (a 2 )//i/ (open symbols) in comparison with eq. ( |TT| ) (thick 
lines). Thin lines indicate 90% confidence intervals, (a) (of) / \ii for n = 2 and A/L = 10~ 4 , as 
a function of £ Bes /L for l/L = 1CT 4 and l/L = 10~ 3 (o). Also shown are the values of af//j,i 
from eq. (jJJ) for l/L = 1(T 4 ( ). (b) (of)///; for n = 2 and A/L = 10" 4 . 
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Figure 4: af/fr for M. m. domesticus SNPs ( |Lindblad-Toh et ai, 2UUU[ ) (•). Simulation 
results for (af)//j,i are shown, corresponding to I = 118bp(o) - the average read length in 
( |Lindblad-Toh et at., 2U"0D| ) - and corresponding to I = 5kb (Diamond). In addition, the 
results according to eq. ( |TT| ) (thick lines), and, for I = 118bp„ 90% confidence intervals (thin 
lines) are shown. 
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Figure 5: (a) Variance of SNPs for chromosome 6 in the human genome. Empirical data for 
bin sizes I = 460bp (•) and I = 200kb (♦) are dete rmined from the data provided by T he 
International SNP Map Working Group (2001). The data points labeled by (1) and (2) 
differ by a choice of cut-off (see text). Also shown are the mean results of coalescent simulations 
corresponding to / = 460bp (o) and / = 200kb (Diamond) and their 90% confidence intervals 



(thin lines), compared to theoretical expectations from (JTTJ) (thick lines). The inset shows the 
empirical distribution of rij(l) corresponding to I = 200kb. (b) Variance of SNPs for the human 
X chromosome. Empirical data for I = 200kb (•) were determined from the data provided by 
[fHE International SNP Map Working Group (200ip . Also shown are mean results of 
coalescent simulations corresponding to I = 200kb (o), and their 90% confidence intervals (thin 
lines) compared to theoretical expectations from equation < JTT|) (thick line). 
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